Lab:
Evaluating PROGRESA

Author

Edward Vytlacil

Published

June 22, 2022

Background

Poverty is one of the most important social and economic problems in the world. In order to alleviate it, many solutions were tried throughout the centuries. For example, in 1601, the Parliament of England passed the The Poor Relief Act, distributing money, food and clothing, and providing work and apprenticeships to the poor. In 1964, US Government created the Food Stamp program, providing food-purchasing assistance for low- and no-income people living in the U.S..

Such programs were criticized because they did not stimulate the recipients to invest in human capital nor improved children’s educational outcomes. To address those two criticisms, many countries implemented conditional cash transfer programs (CCT), i.e., the government only transfers money to low-income persons who meet some criteria, such as enrolling their children into public schools, going to the doctor regularly and receiving vaccinations. The most famous CCT program was implemented in Mexico in 1997 and was initially called PROGRESA (its current name is Oportunidades).

Scientifically, PROGRESA is very influential because it was combined with a large-scale randomized control trial (RCT). In 1997, the Mexican government and the International Food Policy Research Institute defined which villages were eligible to receive this CCT program and, randomly, picked some villages to receive the transfers starting in 1998 (the treated group) and some villages to only receive the transfers starting in 1999 (the control group). The experimental results were very positive with respect to improving families’ social, educational and economic outcomes.

See Skoufias et al. (2001) and
Schultz (2004) for more details on PROGRESA and the experimental results.

These positive results led to a rapid expansion of PROGRESA, which covered 2.6 million Mexican households in 50,000 villages by 2001 In addition, the success of PROGRESA stimulated many other countries to create similar programs (Bangladesh, Brazil, Cambodia, Chile, Colombia, Egypt, Guatemala, Honduras, Indonesia, Jamaica, Nicaragua, Panama, Peru, Phillipines, Turkey and the United States).

See e.g. Attanasio et al. (2010) for Columbia.

In this lab, we will look at the effect of PROGRESA on school enrollment using the data from the randomized controlled trial. Here is a quick summary of the program:

  1. Eligibility for PROGRESA: Only poor households living in poor, rural villages were eligible.
  2. PROGRESA Treatment: Cash transfer to the mother of each household conditional on regular school attendance between 3rd and 9th grades for children less than 18 years old. The subsidy was slightly higher in 7th to 9th grade than in earlier grades to offset higher opportunity cost for older children, and slightly higher for girls than for boys in 7th through 9th grades to promote gender equality in schooling.
Education Level of Student Monthly Payment
3rd Year 70
4th Year 80
5th Year 105
6th Year 135
7th Year Boys 200
Girls 210
8th Year Boys 210
Girls 235
9th Year Boys 225
Girls 255
  1. RCT of PROGRESA: Approximately two-thirds of 495 rural villages were randomly assigned to be treated villages, and all eligible (sufficiently poor) households in such villages were treated. The remaining villages were randomly assigned to be control villages, and no households in those villages were treated.

flowchart LR
  A{495 Villages} --> B{314 <br>  Treatment<br> Villages}
  A --> C{181  <br> Control<br> Villages }
  B --> D[Eligible  Children <br> Treated]
  B --> E(Noneligible Children, <br> Not Treated)
  C --> F[Eligible  Children <br> Controls]
  C --> G(Noneligible Children, <br> Not Treated)

  1. Survey Waves: Five surveys were conducted as part of the RCT. The first two survey waves were conducted in October 1997 and March 1998, and were ``baseline’’ surveys, i.e., were conducted before any household was treated. The last three survey waves were conduced in October 1998, May 1999, and November 1999. These waves are post randomization and after the treatment began. Note that the academic year in Mexico goes from late August to early July.

  2. Outcome The outcome we will focus on is school enrollment, though these surveys contain a wealth of other information.

Data

As a first step, download the data set PROGRESA.csv. Alternatively, you can read the data set directly into R, as done below. This data is from Lee and Shaikh (2014) as posted at the Journal of Applied Econometrics Data Archive. This data is only a small part of the full PROGRESA data set, see (IFPRI) (2018). Also download and read readme.txt which describes the data.

Load Libraries

Start by loading the following libraries. Install them first if you have not already done so.

Library Purpose
modelsummary package by Arel-Bundock (2022) for creating publication-ready tables from regression models, data frames, and summary statistics. Supports LaTeX and HTML output.
ggplot2 package by Wickham (2016) for creating figures. Is more complicated to use than the graphing options in base R, but much more powerful and flexible.
plotly package by Sievert (2020) for creating interactive figures.
# install packages if not already installed
install.packages("modelsummary") 
install.packages("ggplot2")
install.packages("plotly") 
# load packages (make sure they are already installed)
library(modelsummary)
library(ggplot2)
library(plotly)

Instruction for installing and loading R libraries includes:

Other Packages for Creating Tables

There are many packages in R for creating formatted tables.

In this course, we will use the modelsummary package. It allows us to produce publication-quality tables from regression models, data frames, and summary statistics, and supports both HTML and LaTeX output. The same function can therefore be used for handouts, slides, and academic papers.

A very common general-purpose table tool in R is kable from the knitr package (see description here). The kableExtra package provides additional formatting options and is often used in combination with the pipe operator from the tidyverse. These tools are widely used in data science workflows and can produce visually attractive tables, though their default style is not tailored to the conventions typical in empirical economics papers.

Other popular alternatives include fixest, texreg, and gt. The fixest package here includes the etable() function, which produces regression tables and integrates seamlessly with fixed effects, instrumental variables, and clustered standard errors. It is especially popular in applied econometrics. The texreg package here is a lightweight tool focused specifically on regression tables and supports both LaTeX and HTML output, though it is somewhat less flexible for arbitrary data frames. The gt package here is designed for highly customizable and visually polished tables, and is widely used in data science and reporting workflows, though it is not specifically oriented toward the formatting conventions typical in empirical economics papers. The appropriate choice depends on the context and the type of output required.

Read in the data

  • Use read.csv to read the data into a data.frame.
df_ <- read.csv("https://edward-vytlacil.github.io/Data/PROGRESA.csv", header=TRUE, sep=",") # read csv

To review data frames in R, see:

While we are using data frames, data.table is an enhanced version of data frames.

Instruction for reading data into R include:

Examine the data

  • Examine data, make sure we understand variables, data structure, and look for any issues with the data.

  • Explore the data with R functions such as names, dim, head, table and summary.

You can use the help function to see documentation on a function. For example, help(names) gives you the help page for the function names.

Use the codebook readme.txt to understand the definition of each variable.

names(df_)  # name of each column (variable) in data frame df_
 [1] "wave"         "sooloca"      "indivill_seq" "villsize_t"   "progresa1"   
 [6] "sooind_id"    "sex1"         "age1"         "hgc1"         "poor1"       
[11] "school"      
dim(df_)    # number of rows (observations) and columns (variables) in data frame  
[1] 74031    11
head(df_)   # first 6 rows (observations) of data frame df_
  wave sooloca indivill_seq villsize_t progresa1 sooind_id sex1 age1 hgc1 poor1
1    1       1            1         76         1        18    0   10    1     1
2    1       1            2         76         1         5    0   11    2     1
3    1       1            3         76         1        30    0    9    2     1
4    1       1            4         76         1        42    1   14    3     1
5    1       1            5         76         1        14    1    7    0     1
6    1       1            6         76         1         8    0   12    2     1
  school
1      1
2      1
3      1
4      0
5      1
6      1
table(df_$wave)       # frequency counts for variable wave  

    1     2     3     4     5 
14996 12368 15455 15610 15602 
table(df_$progresa1)  # frequency counts for variable progresa (treatment status)  

    0     1 
27723 46308 
table(df_$hgc1)   # frequency counts for variable hgc1 (highest grade completed) 

    0     1     2     3     4     5     6     7     8     9 
 4669  8863  9685  9876  8926  8187 13296  3858  3049  3622 
summary(df_)      # descriptive statistics of all variables in data frame df_
      wave         sooloca     indivill_seq      villsize_t    
 Min.   :1.00   Min.   :  1   Min.   :  1.00   Min.   :  1.00  
 1st Qu.:2.00   1st Qu.:131   1st Qu.:  8.00   1st Qu.: 26.00  
 Median :3.00   Median :253   Median : 18.00   Median : 39.00  
 Mean   :3.06   Mean   :254   Mean   : 25.63   Mean   : 50.26  
 3rd Qu.:4.00   3rd Qu.:395   3rd Qu.: 34.00   3rd Qu.: 63.00  
 Max.   :5.00   Max.   :491   Max.   :210.00   Max.   :210.00  
   progresa1        sooind_id          sex1            age1      
 Min.   :0.0000   Min.   :    1   Min.   :0.000   Min.   : 0.00  
 1st Qu.:0.0000   1st Qu.: 3983   1st Qu.:0.000   1st Qu.: 9.00  
 Median :1.0000   Median : 7894   Median :0.000   Median :11.00  
 Mean   :0.6255   Mean   : 7877   Mean   :0.485   Mean   :11.33  
 3rd Qu.:1.0000   3rd Qu.:11768   3rd Qu.:1.000   3rd Qu.:14.00  
 Max.   :1.0000   Max.   :15669   Max.   :1.000   Max.   :99.00  
      hgc1           poor1       school      
 Min.   :0.000   Min.   :1   Min.   :0.0000  
 1st Qu.:2.000   1st Qu.:1   1st Qu.:1.0000  
 Median :4.000   Median :1   Median :1.0000  
 Mean   :4.029   Mean   :1   Mean   :0.8275  
 3rd Qu.:6.000   3rd Qu.:1   3rd Qu.:1.0000  
 Max.   :9.000   Max.   :1   Max.   :1.0000  

Notes:

  • The data has 74031 child-wave observations, not 74031 children, since each child can appear in the data up to 5 times.

  • The data has an uneven number of observations per wave, in other words, this is an unbalanced panel.

  • poor==1 for all observations, so that all observations are poor enough to be eligible for PROGRESA. We can therefore drop the variable poor.

  • The youngest child is 0 years old, while the oldest child is 99 years old. With the code below, we find that, while the vast majority of rows have an age between 6 and 18, 1 row has a child less than 6 years old, and 261 rows report the child being older than 18 years old. These ages are probably data entry errors. We will later code ages under 6 or over 18 as missing, believing that such values are likely to be data entry errors.

sum(df_$age1<6)    
# 1
sum(df_$age1>=6 & df_$age1<=18)
# 73769
sum(df_$age1>18)
# 261
sum(df_$age1==99)
# 10

A dummy variable is a variable that always equals \(0\) or \(1\). They are also called logical indicator variables. We typically use them to represent TRUE/FALSE, YES/NO type data – for example, you are over age \(18\) or you are not.

The mean of a dummy variable is the fraction of observations for which the variable equals \(1\). Likewise, the sum of a dummy variable is the number of observations for which the variable equals \(1\).

The following code defines a new dummy variable for being over age 18, and computes the number of observations over age 18 and the fraction of observations over age 18.

df_$Dummy1 <- df_$age1>18 #Create dummy variable for age1>18
sum(df_$Dummy1) # number of observations with age1>18
mean(df_$Dummy1) # fraction of observations age1>18

The following code is equivalent, without the step of defining the dummy variable:

sum(df_$age1>18) # number of observations with age1>18
mean(df_$age1>18) # fraction of observations with age1>18

We could further investigate the observations with unusual ages. For example, we could inspect the one row with a reported age of \(0\), and see that the 0-year old child has already completed first grade, making it unlikely that the reported age of zero is correct:

df_[df_$age1==0,]
      wave sooloca indivill_seq villsize_t progresa1 sooind_id sex1 age1 hgc1
44394    4      44           10         14         0      1586    0    0    1
      poor1 school
44394     1      1

The observation’s child id is 1586 and we could inspect the other waves for the same child.

df_[df_$sooind_id==1586,]
      wave sooloca indivill_seq villsize_t progresa1 sooind_id sex1 age1 hgc1
1418     1      44            3         13         0      1586    0    7    1
28916    3      44           11         14         0      1586    0   99    1
44394    4      44           10         14         0      1586    0    0    1
60001    5      44            2         14         0      1586    0    7    1
      poor1 school
1418      1      1
28916     1      1
44394     1      1
60001     1      1

We see that the same child was coded as having the following ages:

  • Wave 1, age 7;
  • Wave 3, age 99;
  • Wave 4, age 0;
  • Wave 5, age 7

Clearly, the child age couldn’t jump from age 7 to age 99 between waves 1 and 3, and couldn’t go down in age from 99 years old to 0 years old from wave 3 to wave 4. Clearly the ages of 0 and 99 are coded incorrectly. We could guess that the child is actually 7 years old in all the waves (imputing the ages in waves 3 and 4 based on the ages in waves 1 and 5), but here we will just later code the 0 and 99 as missing.

Number of Observations

The number of observations plays a critical role in statistical analysis, including our choice of research methodology. The more (independent) observations:

  • the more precise the estimates;

  • the better the approximations coming from the central limit theorem and other asymptotic results;

  • the higher the power of tests to reject false hypotheses, and the narrower our confidence intervals;

  • the more flexible are the models that we can feasibly fit to the data.

We have 74031 rows in our data frame. It is a panel data set, where each row is a child-wave observations, with each child appearing in the data frame up to \(5\) times. We shouldn’t think of number of child-wave observations as number of independent observations. For example, a given child being a boy in wave one is not independent of the same child being a boy in wave two. We might think that independence across children is reasonable, or at least a reasonable approximation, and thus might consider our number of observations to be the number of children when thinking about statistical precision or accuracy of asymptotic approximations. On the other hand, the experiment assigned treatment at the village level, and we might think children within a village (certainly within the same family) are not independent. We might therefore think that we should think of the number of villages as the number of observations when considering statistical precision and asymptotic approximations.

  • Recall
    • Child id is sooind_id.
    • Village id is sooloca.
  • Using nrow, we find that the data has
    • 74031 rows (child-wave observations)
  • Using length and unique, we find that the data has
    • 15669 child-observations in the data, of which 9799 are treated and 5870 are control.
    • 491 villages, of which 308 are treated and 183 are control.
nrow(df_)  # Number of rows (child-wave observations)
# 74031
length(unique(df_$sooind_id)) #Number of children
# 15669
length(unique(subset(df_,df_$progresa1==1)$sooind_id)) #Number of treated children
# 9799
length(unique(subset(df_,df_$progresa1==0)$sooind_id)) #Number of control children
# 5870
length(unique(df_$sooloca))   #Number of villages
# 491
length(unique(subset(df_,df_$progresa1==1)$sooloca)) #Number of treated villages
# 308
length(unique(subset(df_,df_$progresa1==0)$sooloca)) #Number of control villages
# 183
  • unique is a function whose argument is a vector (or data.frame) and which returns the unique values of that vector. For example:
a<- c(1,1,2,2,3,4,5) 
unique(a)  
#  1 2 3 4 5
b<- c("Ed","Joe","Ed","Lisa","Joe","Karen")
unique(b)  
# "Ed"    "Joe"   "Lisa"  "Karen"
  • length is a function whose argument is a vector and which returns the length of that vector. For example:
length(a)  
# 7
length(b)
# 6
a.0 <- unique(a)
b.0 <- unique(b)
length(a.0)
# 5
length(b.0)
# 4

In R one can nest functions. For example: length(unique(a)) is a nested function, and should be interpreted from the inner most function (inner most parentheses) outward. First the function unique is applied the vector a, finding the unique elements of a, and then length finds the length of that vector and thus the number of unique elements of a. The following code is equivalent:

length(unique(c(1,2,1,2,3,4,4,5)))
# 5
a<- c(1,2,1,2,3,4,4,5) 
a.0 <- unique(a)
length(a.0)
# 5

Likewise, the code length(unique(subset(df_,df_$progresa1==0)$sooind_id)) has nested functions, and should be interpreted from the inner most function (inner most parentheses) outward. In particular:

  • subset(df_,df_$progresa1==0) is taking the subset of the data frame df_ that has the variable progresa1==0 (untreated observations). subset(df_,df_$progresa1==0)$sooind_id is taking the column (variable) sooind_id (child id) from the resulting data frame.
  • unique(subset(df_,df_$progresa1==0)$sooind_id) is taking the unique elements of the vector subset(df_,df_$progresa1==0)$sooind_id, in other words, the unique child id numbers from the data frame of control observations.
  • length(unique(subset(df_,df_$progresa1==0)$sooind_id)) is taking the length of the vector unique(subset(df_,df_$progresa1==0)$sooind_id), in other words, how many unique child id numbers are there for control children.

We can use nested functions, which generally results in more concise code, though the code can potentially be confusing. The following code is equivalent:

length(unique(subset(df_,df_$progresa1==0)$sooind_id))
#5870
c.0 <- subset(df_,df_$progresa1==0)$sooind_id
c.1 <- unique(c.0)
length(c.1)
#5870

Preparing data for analysis

Before we actually start analyzing the data, we will prepare it for analysis. In particular, we will complete the folowing steps:

  • Change name of variables for convenience:

    • progresa1 to treat,
    • sex1 to girl .
  • Using the subset command, estrict data to first wave (first baseline wave) and fourth wave (second post-treatment wave) and keep only relevant variables.

  • Using the ifelse command, code as missing any age less than 6 or greater than 18.

  • Following Schultz (2004) create balanced sample, and additionally restrict to children aged 6 to 16 in first wave.

  • Using the ifelse command, create an indicator variable for whether an observation is from the post period, i.e., a variable that equals 1 if the observation is from waves 5.

  • Create seperate data frames for pre and post treatment observations.

  • Create baselie grade variable for later analysis of heterogeneous effects by baseline grade.

df_$treat <- df_$progresa1  
df_$girl <- df_$sex1  
df<-subset(df_, wave==4 | wave==1,
           select= c(sooloca,sooind_id,age1,hgc1,school,treat,girl,wave))
rm(df_)
df$age1 <- ifelse(df$age1>=6 & df$age1<=18, df$age1, NA) # setting implausible ages to missing
balanced_ids <- intersect(df[df$wave==4,"sooind_id"],
                  df[df$wave==1 & df$age1>=6 & df$age1<=16,"sooind_id"])
df <- df[df$sooind_id %in% balanced_ids,]
df$post <- ifelse(df$wave==4, 1, 0)  # creating dummy variable for post period
dfPre <- df[df$post==0,]  # creating new data.frame with only pre treatment observations
dfPost <- df[df$post==1,] # creating new data.frame with only post treatment observations

# Create baseline HGC (wave 1) at child level
base_hgc <- df[df$wave == 1, c("sooind_id", "hgc1")]
names(base_hgc)[names(base_hgc) == "hgc1"] <- "hgc0"

# Merge into df/dfPre/dfPost so every row carries the child's baseline grade
df   <- merge(df,   base_hgc, by = "sooind_id", all.x = TRUE, sort = FALSE)
dfPre <- merge(dfPre, base_hgc, by = "sooind_id", all.x = TRUE, sort = FALSE)
dfPost<- merge(dfPost,base_hgc, by = "sooind_id", all.x = TRUE, sort = FALSE)
  • This data is already fairly “clean”. However, data often needs more extensive preparation for analysis, sometimes called “data wrangling”.

  • For methods to help with more involved “data wrangling” in R, see either:

    • R Workflow by Frank Harrell, approach based on HMISC.
    • R for Data Science by Hadley Wickham and Garret Grolemund, approach based on tidyverse.

Balance at Baseline: Summary Table

If the randomization is done correctly, then any differences at baseline (pre-treatment) between those assigned to treatment vs control is due purely to chance. However, by pure chance, random assignment can result in differences between those assigned to treatment vs control.

For this reason, it is common in experimental papers to produce a table examining whether, on average, before receipt of treatment, those assigned to treatment have similar values of the covariates as those assigned to control. The tables include mean baseline covariates, separately for those assigned to treatment and those assigned to control. The tables often also include a standardized difference, the difference between the two groups dividing by the pooled estimated standard deviation. The tables often includes t-tests of the null of equal means, though such hypothesis testing in this context is hard to justify. (see Bruhn and McKenzie 2009 Section G).

Researchers sometimes re-randomize treatment assignment if the baseline is not sufficiently balanced. Doing so does typically improve precision, similar to proper linear regression adjustment for baseline characteristics (Li, Ding, and Rubin 2018; Lin 2013; and Negi and Wooldridge 2021). However, doing so invalidates conventional standard errors and inference procedures, and is inferior to other methods to design the experiment or to adjust for differences at baseline. See, e.g., Bai (2022).

Create a table reporting the mean of baseline characteristics separately by treatment assignment, the difference in those means, and the standardized difference (dividing the difference by the pooled estimated standard deviation).

Balance at Baseline
Variable Control Treated Diff. Std Diff
Girl 0.49 0.48 −0.01 −0.03
Age 10.35 10.33 −0.02 −0.01
Highest Grade 3.41 3.44 0.03 0.01
Enrolled 0.86 0.86 0.00 0.00
return_mean_by_treatment <- function(x){    
  means.t<-tapply(x,dfPre$treat,mean)
  var.t<-tapply(x,dfPre$treat,var)
  return(c(means.t,var.t))
  }

vars <- c("girl","age1","hgc1","school")
output <- sapply(dfPre[vars],return_mean_by_treatment)

means <- output[1:2,]
diffs <- output[2,]-output[1,]
N1 <- sum(dfPre$treat)
N0 <- sum(1-dfPre$treat)
pooled.sd <- sqrt(((N0-1)*output[3,]+(N1-1)*output[4,])/(N0+N1-2))
std.diffs <- (output[2,]-output[1,])/pooled.sd 
results0 <- t(rbind(means, diffs, std.diffs))

colnames(results0) <- c("Control", "Treated", "Diff.", "Std Diff")

varlabels <- c("Girl", "Age", "Highest Grade", "Enrolled")

# Convert the matrix `results0` into a data frame so it can be printed as a table.
results0_df <- data.frame(
  Variable = varlabels,
  results0,
  row.names = NULL,
  check.names = FALSE # Prevent R from automatically modifying column names.
)

# Print the table using modelsummary's datasummary_df() function.
modelsummary::datasummary_df(
  results0_df,
  output = "html",
  title  = "Balance at Baseline",
  fmt    = 2 # Format numeric entries to 2 decimal places 
)

tapply allows you to apply a function separately by groups. The general form is tapply(X,INDEX,FUN) which applies the function FUN to the vector X separately by levels of INDEX. For example, tapply(dfPre$age1,dfPre$treat,mean) applies the function mean to the vector dfPre$age1 by levels of dfPre$treat (treated vs control). Using that code, we find that the mean age of controls is 10.35 and the mean age of treated is 10.33

tapply(dfPre$age1,dfPre$treat,mean)  # mean age by treatment status
       0        1 
10.35253 10.32957 

sapply allows you to apply a function to each element of a list, vector or data.frame. The general form is sapply(X,FUN) which applies the function FUN to each element of X. For example, sapply(dfPre,mean) applies the function mean to each column (each variable) in the data frame dfPre. In the following code, we first define a list of variables vars, so that dfPre[vars] is a data frame only containing the variables in vars, and then sapply(dfPre[vars],mean) is applying the function mean to each of those variables.

vars <- c("girl","age1","hgc1","school")
sapply(dfPre[vars],mean)  # mean of each column in dfPre[vars]
      girl       age1       hgc1     school 
 0.4838624 10.3381568  3.4305815  0.8567618 

Using tapply and sapply can help you create efficient code and save you time.

For more on tapply and sapply, see Matloff (2022) Lesson 8 and Lesson 28.

For more on user defined functions see Grolemund (2014) Sections 1.4 and 1.5, or Matloff (2022) Lessons 16 and 18.

You can write your own functions in R. Objects of class “function” in R have the form:

# defining function
function.name <- function(<arguments>){
  # Function body, code taking arguments as inputs
}

# calling function
function.name(<arguments>)

The last value computed in the function body is the value returned by the function. For example, the following function adds one to its argument:

f.addone <- function(x){
  x+1
}

f.addone(1)
# 2
f.addone(10)
# 11

Defining functions can help make our code more efficient. Consider:

return_mean_by_treatment <- function(x){    
  means.t<-tapply(x,dfPre$treat,mean)
  var.t<-tapply(x,dfPre$treat,var)
  return(c(means.t,var.t))
  }

vars <- c("girl","age1","hgc1","school")
output <- sapply(dfPre[vars],return_mean_by_treatment)

The above code is defining a function named return_mean_by_treatment. It will take a vector as it’s argument. It performs the following steps:

  1. calculates the mean of the vector separately by treatment status;
  2. calculates the variance of vector separately by treatment status;
  3. returns a means and vectors calculated in steps (1) and (2).

Using sapply, the code then applies that function to each column in dfPre[vars].

For more on user defined functions see Grolemund (2014) Sections 1.4 and 1.5, or Matloff (2022) Lessons 16 and 18.

Balance at Baseline: School Enrollment at Each Grade

Consider balance in school enrollment at each grade level. In particular, create a figure showing fraction enrolled in school at each grade by treatment assignment. Create the figure separately for boys and girls.

To learn more about ggplot, see http://r-statistics.co/ggplot2-Tutorial-With-R.html.

MeanSchC.s  <-  with(subset(dfPre,  treat==0),tapply(school, list(hgc1,girl), mean))
MeanSchT.s <-  with(subset(dfPre,  treat==1),tapply(school, list(hgc1,girl), mean))

Grade <- as.factor(rep(c(1:10),2))
Group <- c(rep(" Control", 10), rep("Treated", 10))
MeanSch.b <- matrix(c(MeanSchC.s[,1],MeanSchT.s[,1]), nrow = 20, ncol = 1)
MeanSch.g <- matrix(c(MeanSchC.s[,2],MeanSchT.s[,2]), nrow = 20, ncol = 1)

tab.b <- data.frame(Grade, Group, MeanSch.b)
tab.g <- data.frame(Grade, Group, MeanSch.g)


plotPre.b <- ggplot(tab.b, aes(x = Grade, y = MeanSch.b, fill = Group)) +
  geom_col(width = 0.7, position = position_dodge(width=0.8)) +
  theme_bw(base_size = 11) +
  theme(legend.position = "bottom", legend.title = element_blank()) +
  scale_y_continuous(breaks = seq(from = 0, to = 1, by = 0.1)) +
  xlab("Grade Level") +
  ylab("Mean School Enrollment")+
  ggtitle("Boys: School Enrollment by Grade, Pre Period")

plotPre.g <- ggplot(tab.g, aes(x = Grade, y = MeanSch.g, fill = Group)) +
  geom_col(width = 0.7, position = position_dodge(width=0.8)) +
  theme_bw(base_size = 11) +
  theme(legend.position = "bottom", legend.title = element_blank()) +
  scale_y_continuous(breaks = seq(from = 0, to = 1, by = 0.1)) +
  xlab("Grade Level") +
  ylab("Mean School Enrollment")+
  ggtitle("Girls: School Enrollment by Grade, Pre Period")
ggplotly(plotPre.b,tooltip="y")
ggplotly(plotPre.g,tooltip="y")

Using Post Data to Estimate Effect

Now, use mean differences on post-treatment data to estimate the effect of PROGRESA on school enrollment. By random assignment, the mean difference estimator does not suffer from selection bias. Furthermore, we can take mean difference conditional on any covariate that is not effected by the treatment, including any baseline characteristic, to estimate a conditional average treatment effect.

  • Overall: 0.039,

  • By sex:

    • For boys: 0.033,

    • For girls: 0.046.

mean(dfPost[ dfPost$treat==1,"school"]) -  mean(dfPost[ dfPost$treat==0,"school"])
# 0.04063189
mean(dfPost[dfPost$treat==1&dfPost$girl==0,"school"])-mean(dfPost[dfPost$treat==0&dfPost$girl==0,"school"])
# 0.02215276
mean(dfPost[dfPost$treat==1&dfPost$girl==1,"school"])- mean(dfPost[dfPost$treat==0&dfPost$girl==1,"school"])
# 0.05975947

Using Post Data to Estimate Effect by Baseline Grade Level

Now, create a figure examining the effect of PROGRESA on enrollment by baseline grade level, separately for boys and girls.

Code
MeansC0 <- with(dfPost, tapply(school, list(treat, girl, hgc0), mean))

dimnames(MeansC0)[[1]] <- c("Control", "Treated")
dimnames(MeansC0)[[2]] <- c("Boys", "Girls")

EffSchBoy  <- MeansC0[2,1,] - MeansC0[1,1,]
EffSchGirl <- MeansC0[2,2,] - MeansC0[1,2,]

# Use the baseline-grade support for x-axis
grade_vals <- as.numeric(dimnames(MeansC0)[[3]])  # typically 0:9
Grade <- as.factor(rep(grade_vals, 2))

EffSch <- c(EffSchBoy, EffSchGirl)
sex <- c(rep("Boy", length(grade_vals)), rep("Girl", length(grade_vals)))

tabEff <- data.frame(Grade, sex, EffSch)

plotEff <- ggplot(tabEff, aes(x = Grade, y = EffSch, fill = sex)) +
  geom_col(width = 0.7, position = position_dodge(width = 0.8)) +
  theme_bw(base_size = 10) +
  theme(legend.position = "bottom", legend.title = element_blank()) +
  scale_y_continuous(breaks = seq(from = -0.1, to = 0.1, by = 0.05)) +
  xlab("Baseline highest grade completed (wave 1)") +
  ylab("Mean school enrollment effect (wave 4)") +
  ggtitle("Estimated Treatment Effects on Enrollment by Baseline Grade and Sex")

ggplotly(plotEff, tooltip = "y")

Some results are surprising, e.g. the estimated effect on boys in first grade is 0.008. Presumably, some of these results are driven by random sampling noise. The level of random sampling noise depends on the number of observations (and the level of dependence/independence across observations). With i.i.d. data,\[\mbox{Var}(\bar{X}_N)=\frac{\sigma^2}{N},\] where \(\sigma^2 = \mbox{Var}(X_i)\). Estimating a mean difference on a small subgroup will have more sampling noise than an estimate on the full sample – we expect better precision for the estimate for the full sample than in an estimate for girls entering grade \(2\). In addition, estimating results separately by many subgroups can give rise to spurious results due to data mining. We will return to these issues.

No. of Obs. by Grade-Sex-Treatment

Examine:

  1. The number of boys in each grade by treatment status;
  2. The number of girls in each grade by treatment status;
  3. The number of villages with boys in each each grade by treatment status;
  4. The number of villages with boys in each each grade by treatment status;

These counts should give us some idea of how much sampling noise to expect for each estimated treatment effect by sex and grade.

Code
person.counts <- with(dfPost,tapply(sooind_id,list(treat,girl,hgc0),function(x){length(unique(x))}))
dimnames(person.counts)[[1]] <-c("Control","Treated")
dimnames(person.counts)[[2]] <-c("Boys","Girls")
 
Grade <- as.factor(rep(c(1:10),2))
Group <- c(rep(" Control", 10), rep("Treated", 10))
numbers.boys <- matrix(c(person.counts[1,1,],person.counts[2,1,]), nrow = 20, ncol = 1)
numbers.girls<- matrix(c(person.counts[1,2,],person.counts[2,2,]), nrow = 20, ncol = 1)

tab.boys <- data.frame(Grade, Group, numbers.boys)
tab.girls <- data.frame(Grade, Group, numbers.girls)

plot.n.boys <- ggplot(tab.boys, aes(x = Grade, y = numbers.boys, fill = Group)) +
  geom_col(width = 0.7, position = position_dodge(width=0.8)) +
  theme_bw(base_size = 11) +  
  theme(legend.position = "bottom", legend.title = element_blank()) +
  scale_y_continuous(limits=c(0,1100),breaks = seq(from = 0, to = 1100, by = 100)) +
  xlab("Baseline Grade Level") +
  ylab("Number of Boy Observations")+
  ggtitle("Number of Boy Observations, Treated and Control")



plot.n.girls <- ggplot(tab.girls, aes(x = Grade, y = numbers.girls, fill = Group)) +
  geom_col(width = 0.7, position = position_dodge(width=0.8)) +
  theme_bw(base_size = 11) +
  theme(legend.position = "bottom", legend.title = element_blank()) +
 scale_y_continuous(limits=c(0,1100),breaks = seq(from = 0, to = 1100, by = 100)) +
  xlab("Baseline Grade Level") +
  ylab("Number of Girl Observations")+
  ggtitle("Number of Girl Observations, Treated and Control")

  
village.counts <- with(dfPost,tapply(sooloca,list(treat,girl,hgc0),function(x){length(unique(x))}))
dimnames(village.counts)[[1]] <-c("Control","Treated")
dimnames(village.counts)[[2]] <-c("Boys","Girls")


numbers.boys.v <- matrix(c(village.counts[1,1,],village.counts[2,1,]), nrow = 20, ncol = 1)
numbers.girls.v<- matrix(c(village.counts[1,2,],village.counts[2,2,]), nrow = 20, ncol = 1)

tab.boys.v <- data.frame(Grade, Group, numbers.boys.v)
tab.girls.v <- data.frame(Grade, Group, numbers.girls.v)

plot.n.boys.v <- ggplot(tab.boys.v, aes(x = Grade, y = numbers.boys.v, fill = Group)) +
  geom_col(width = 0.7, position = position_dodge(width=0.8)) +
  theme_bw(base_size = 11) + 
  theme(legend.position = "bottom", legend.title = element_blank()) +
 scale_y_continuous(limits=c(0,1100),breaks = seq(from = 0, to = 1100, by = 100)) + 
  xlab("Baseline Grade Level") +
  ylab("Number of Villages with Boy Observations")+
  ggtitle("Number of Villages with Boy Observations")



plot.n.girls.v <- ggplot(tab.girls.v, aes(x = Grade, y = numbers.girls.v, fill = Group)) +
  geom_col(width = 0.7, position = position_dodge(width=0.8)) +
  theme_bw(base_size = 11) +
  theme(legend.position = "bottom", legend.title = element_blank()) +
 scale_y_continuous(limits=c(0,1100),breaks = seq(from = 0, to = 1100, by = 100)) + 
  xlab("Baseline Grade Level") +
  ylab("Number of Villages with Girl Observations")+
  ggtitle("Number of Villages with Girl Observations")

Conclusion

Overall, we find positive estimated effects of PROGRESA on school enrollment, and estimate that the effect is larger for girls than for boys. The estimated effects vary widely by sex-grade cells, with the results the most extreme and unexpected where we have the fewest number of observations and thus expect the most sampling noise.

The next natural step is to conduct formal inference, for example, to test the null hypothesis of no treatment effect including for subgroups. We will return to hypothesis testing next time.

References

Arel-Bundock, Vincent. 2022. “Modelsummary: Data and Model Summaries in r.” Journal of Statistical Software 103: 1–23.
Attanasio, Orazio, Emla Fitzsimons, Ana Gomez, Martha Isabel Gutierrez, Costas Meghir, and Alice Mesnard. 2010. “Children’s Schooling and Work in the Presence of a Conditional Cash Transfer Program in Rural Colombia.” Economic Development and Cultural Change 58 (2): 181–210.
Bai, Yuehao. 2022. “Optimality of Matched-Pair Designs in Randomized Controlled Trials.” arXiv Preprint arXiv:2206.07845.
Bruhn, Miriam, and David McKenzie. 2009. “In Pursuit of Balance: Randomization in Practice in Development Field Experiments.” American Economic Journal: Applied Economics 1 (4): 200–232. https://www.jstor.org/stable/25760187.
Grolemund, Garrett. 2014. Hands-on Programming with r: Write Your Own Functions and Simulations. " O’Reilly Media, Inc.".
(IFPRI), International Food Policy Research Institute. 2018. Mexico, Evaluation of PROGRESA.” Harvard Dataverse. https://doi.org/10.7910/DVN/05BMJY.
Lee, Soohyung, and Azeem M Shaikh. 2014. “Multiple Testing and Heterogeneous Treatment Effects: Re-Evaluating the Effect of Progresa on School Enrollment.” Journal of Applied Econometrics 29 (4): 612–26. https://home.uchicago.edu/~amshaikh/webfiles/progresa.pdf.
Li, Xinran, Peng Ding, and Donald B Rubin. 2018. “Asymptotic Theory of Rerandomization in Treatment–Control Experiments.” Proceedings of the National Academy of Sciences 115 (37): 9157–62.
Lin, Winston. 2013. “Agnostic Notes on Regression Adjustments to Experimental Data: Reexamining Freedman’s Critique.” The Annals of Applied Statistics 7 (1): 295–318.
Matloff, Norm. 2022. fasteR: Fast Lane to Learning r! https://github.com/matloff/fasteR#faster-fast-lane-to-learning-r.
Negi, Akanksha, and Jeffrey M Wooldridge. 2021. “Revisiting Regression Adjustment in Experiments with Heterogeneous Treatment Effects.” Econometric Reviews 40 (5): 504–34.
Schultz, T Paul. 2004. “School Subsidies for the Poor: Evaluating the Mexican Progresa Poverty Program.” Journal of Development Economics 74 (1): 199–250.
Sievert, Carson. 2020. Interactive Web-Based Data Visualization with r, Plotly, and Shiny. Chapman; Hall/CRC. https://plotly-r.com.
Skoufias, Emmanuel, Susan W Parker, Jere R Behrman, and Carola Pessino. 2001. “Conditional Cash Transfers and Their Impact on Child Work and Schooling: Evidence from the Progresa Program in Mexico [with Comments].” Economia 2 (1): 45–96. https://www.jstor.org/stable/20065413.
Wickham, Hadley. 2016. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. https://ggplot2.tidyverse.org.